| y | SNP1 | SNP2 | SNP3 | SNP4 | SNP5 | SNP6 | SNP7 | SNP8 | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 |
| 2 | 0 | 1 | 2 | 2 | 1 | 0 | 1 | 0 | 2 |
| 3 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 2 | 2 |
| 4 | 0 | 1 | 2 | 2 | 0 | 1 | 1 | 1 | 2 |
| 5 | 0 | 0 | 2 | 2 | 1 | 1 | 0 | 1 | 2 |
| 6 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 |
| 7 | 1 | 0 | 2 | 2 | 0 | 1 | 0 | 1 | 2 |
| 8 | 0 | 1 | 2 | 2 | 0 | 1 | 1 | 1 | 2 |
| 9 | 0 | 1 | 2 | 2 | 0 | 1 | 1 | 1 | 2 |
Department of Statistics & Data Science
Carnegie Mellon University
October 28, 2024
Figuring out which genetic variants are most responsible for a disease is difficult.
Suppose there is outcome \(Y_i \in \{0, 1 \}\) with relationship \[ Y_i = \text{Bern}(p_i) \\ \log\left(\frac{p_i}{1 - p_i}\right) = X_i \beta \] where \(X_i \in \mathbb \{0,1,2\}^d\) for \(i \in \{1, \dots, n \}\) and \(\beta\) is sparse.
Goal: We want to find the non-zero entries of \(\beta\).
Suppose there is outcome \(Y_i \in \mathbb R\) with relationship \[ \mathbb E\left(Y_i|X_i\right) = X_i \beta \] where \(X_i \in \mathbb \{0,1,2\}^d\) for \(i \in \{1, \dots, n \}\) and \(\beta\) is sparse.
Goal: We want to find the non-zero entries of \(\beta\).
| y | SNP1 | SNP2 | SNP3 | SNP4 | SNP5 | SNP6 | SNP7 | SNP8 | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 |
| 2 | 0 | 1 | 2 | 2 | 1 | 0 | 1 | 0 | 2 |
| 3 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 2 | 2 |
| 4 | 0 | 1 | 2 | 2 | 0 | 1 | 1 | 1 | 2 |
| 5 | 0 | 0 | 2 | 2 | 1 | 1 | 0 | 1 | 2 |
| 6 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 |
| 7 | 1 | 0 | 2 | 2 | 0 | 1 | 0 | 1 | 2 |
| 8 | 0 | 1 | 2 | 2 | 0 | 1 | 1 | 1 | 2 |
| 9 | 0 | 1 | 2 | 2 | 0 | 1 | 1 | 1 | 2 |
With these structures consistent model selection impossible.
Suppose we have data with 4 features in a 2-block design
and the following relationship
\[ \mathbb E(Y_i|X_i) = X_{i,1} + X_{i,3} \] but \(\textrm{Cor} \left( X_{i,1}, X_{i,2} \right) = \textrm{Cor} \left( X_{i,3}, X_{i,4} \right) = \rho\) is very high.
choose an element from each of \(\{X_{\bullet,1}, X_{\bullet,2} \}\) and \(\{X_{\bullet,3}, X_{\bullet,4} \}\).1
Suppose we have data with 9 features in a 3-block design and
the following relationship
\[ \mathbb E(Y_i|X_i) = 3 X_{i,1} + 2 X_{i,4} + X_{i,7} \] but again there is high within-block correlation.
Let each index correspond to the columns of \(\mathbf X\) where \[ \mathbf X = \begin{pmatrix} X_{1,1} & X_{1,2} & \dots & X_{1, d} \\ X_{2,1} & \ddots & & \vdots \\ \vdots & & \ddots & X_{n-1, d}\\ X_{n, 1} & \dots & X_{n, d-1} & X_{n,d} \end{pmatrix} \]
Idea
In the 3-block design, as \(\rho \rightarrow 1\) and \(n \rightarrow \infty\), the probability that each of the first 3 features minimize the error is \(\frac{1}{3}\).
Alternative idea
Suppose that the first 3 features have equal signal strength. We expect that as \(n \rightarrow \infty\), the probability each of them minimizes the in-sample error tends toward \(\frac{1}{3}\).
If we knew the these probabilities, we could just select all the univariate models with similarly high values.
Consider dataset \(\mathcal D = \left( X_i, Y_i \right)_{i=1}^n\).
Suppose we have \(p\) models that comprise the model set \(\mathcal{M} = \{1, \dots, p\}\).
For each \(m \in \mathcal{M}\), \(\hat f_m(\mathcal{D})\) is a model fitted using dataset \(\mathcal{D}\).
Let \(\ell\) be a loss function.
For each \(r \in \mathcal{M}\), there exists some probability \(\gamma_r\) that \(r\) minimizes the loss. \[ \def\argmin{\mathop{\mathrm{arg\,min}}} \def\argmax{\mathop{\mathrm{arg\,max}}} \gamma_r = \mathbb P \left( r = \argmin_{ m\leq p} \sum_{i=1}^{n} \ell \left( X_i; \hat f_m(\mathcal D) \right) \right) \] If multiple models have the same high probability of selection, we should select all of them.
Consider dataset \(\mathcal D = \left( X_i, Y_i \right)_{i=1}^n\).
Suppose we have \(p\) models that comprise the model set \(\mathcal{M} = \{1, \dots, p\}\).
\(\hat f_m(\mathcal{D})\) is a model fitted using dataset \(\mathcal{D}\).
Let \(\ell\) be a loss function.
For each \(r \in \mathcal{M}\), there exists some probability \(\gamma_r\) that \(r\) minimizes the loss.
To estimate, draw \(B\) subsamples \(\mathcal D^{1*}, \dots, \mathcal D^{B*}\), each of size \(\tilde n\)\(\frac{\tilde{n}}{n} \rightarrow 0\). Then, calculate for all \(r \in \mathcal M\)
\[ \hat \gamma_{r} = \frac{1}{B} \sum_{b=1}^B \mathbb{I} \left( r = \argmin_{ m \leq p} \sum_{i=1}^{\tilde n} \ell \left( X_i^{b*}; \hat f_m \left( \mathcal{D}^{b*} \right) \right) \right) \]
This produces a set of frequencies, but we need a method capable
of selecting a set of them that’s likely to contain the true best(s).
This is where the ranking and selection literature comes into play.
Using methods from the multinomial ranking and selection literature, we can find a threshold \(\theta\) for which \(\mathbb P( \argmax_q \gamma_q \in \{r : \hat \gamma_r \geq \max_s \hat\gamma_s - \theta \} ) \geq 1 - \alpha\) (Gupta and Nagel 1967; Panchapakesan 1971).
Recall the previous example \[ \mathbb E(Y_i|X_i) = 3 X_{i,1} + 2 X_{i,4} + X_{i,7} \] where each of the above features has 2 highly correlated noise features.
We can apply the resampling procedure to all univariate models, which may return counts like \[ \{\color{#E69F00}{160, 178, 176}, \color{#56B4E9}{104, 92, 119}, \color{#009E73}{50, 61, 60} \}. \]
Then the multinomial procedure will return a set of indices that comprise a set likely to contain the true most probable selection(s) like
\[ \{1, 2, 3\}. \]
Using these tools, we can construct a single a set of univariate models. In order to obtain larger models, we need only embed this within forward selection.
We call this method model path selection (Chapter 2).
d <- nstep
current_models <- '1'
candidate_models <- paste0(current_models, ' + X', 1:p)
for(i in 1:d) {
nselect <- length(current_models)
new_models <- c()
for(j in 1:nselect) {
variable_set <- find_variable_set(candidate_models[j]) # returns vector of variable names
new_models <- c(new_models, paste0(candidate_models[j], ' + ', variable_set))
}
current_models <- new_models
}
🚨There is no stopping rule!🚨
Theorems 1 and 2 in Chapter 3
For cross-validation risk estimate \(\hat R_{\textrm{cv}, r}\) Let \(v_i\) be the fold containing data \(X_i\), then \[\frac{1}{n}\sum_{i\leq n} \ell(X_i; \hat f_m(\mathcal{D_{-v_i}}))\] and center \(\mu_r\) \[\frac{1}{V} \sum_{v \leq V} \mathbb{E} \left[ \ell(X_0; \hat f_m(\mathcal{D}_{-v})) | \mathcal{D}_{-v}\right]\] \[\text{or}\] \[\frac{1}{V} \sum_{v \leq V} \mathbb{E} \left[ \ell(X_0; \hat f_m(\mathcal{D}_{-v})) \right]\], \[ \sup_{z\in\mathbb R} \left|\mathbb P\left(\max_{1\le r\le p}\sqrt{n}(\hat R_{{\rm cv},r}-\mu_r)\le z\right)-\mathbb P\left(\max_{1\le r\le p} Y_r \le z\right)\right|\rightarrow 0 \] where \(\mathbf Y = (Y_1, \dots, Y_p)\) is a centered gaussian vector with matching covariance.
Let \(\mathcal D^i\) equal \(\mathcal D\) but with element \(i\) replaced with an iid copy of \((X_i, Y_i)\). For all \(\hat f\), the assumptions require that the first and second order stability of the loss follows
\[\begin{align*} \nabla_i \ell & := \ell(X_0; \hat f (\mathcal D)) - \ell(X_0; \hat f (\mathcal D^i)) & << n^{-1/2} \\ \nabla_j \nabla_i \ell & & << n^{-3/2}. \end{align*}\]
Select \(r\) if you fail to reject \[ H_{0, r} : \mu_{r} - \mu_{ s} \leq 0 \] for all \(s\neq r\) (Lei 2020).
\[ \, \]
Note
Both of these methods are actual asymptotically valid confidence sets.
Suppose we have data of the form \[ \mathbb{E}(Y_i|X_i) = X_{i,1} + X_{i,2} \] and \(\mathcal{M} = \{ f_1, f_{1,2} \}\) are linear models that use the first feature and both features, respectively.
Therefore, we will use this in MPS to check whether subsequently added features create significantly better models.
\(\mathcal{M}^{\textrm{node}}_1 = \{ \{1\}, \{1, 3\}, \{1, 4\} \}\), \(\mathcal{M}^{\textrm{node}}_2 = \{ \{5\}, \{5, 1\} \}\)
\(\mathcal{M}^{\textrm{level}}_1 = \{ \{1\}, \{5\}, \{1, 3\}, \{1, 4\}, \{5, 1 \} \}\)
Suppose the CV MCS selects each of \(\mathcal{M}^{\textrm{node}*}_1 = \{\{1, 3\}, \{1, 4\} \}\)
\(\mathcal{M}^{\textrm{node}*}_2 = \{ \{5\}, \{ 5, 1\} \}\)
Suppose the CV MCS selects \(\mathcal{M}^{\textrm{level}*}_1 = \{\{1, 3\}, \{1, 4\}, \{5\}, \{5, 1\} \}\)